[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

StanHatko · 2025-02-06T02:59:02Z

Fixes nondeterminism from the sampler that occurs even at zero temperature. The patch supports both zero and nonzero temperatures for different entries in the same batch, using the correct operation in each case. It does this by multiplying by an is-zero indicator for the zero temperature calculation and is-nonzero indicator for positive temperature case. These two are then summed. For any entry, only one term in the sum is nonzero as the two cases are mutually exclusive.

I created this patch and discussed it in #5404 (comment). I used this patch internally and it greatly reduced variation in answers at zero temperature (remaining nondeterminism I think is due to CUDA nondeterminacy in the steps leading up to calculation of the logits).

Please feel free to make any improvements to this patch you can think of.

FIX #5404 (patch discussed there).

FIX #5898 (part from sampler, nondeterminism from the computation of the logits themselves not covered here).

supports both zero and nonzero temperatures in batch see vllm-project#5404 (comment) Signed-off-by: Stan Hatko <stan_hatko@live.com>

github-actions · 2025-02-06T02:59:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Stan Hatko <stan_hatko@live.com>

ywang96

Very much appreciated the investigation! IMO this is quite an elegant patch - @WoosukKwon What do you think?

StanHatko · 2025-02-09T14:18:20Z

For the pre-commit hook error here, it shows missing SPDX headers in the file tests/lora/test_ultravox.py which I didn't modify at all in my commits. The pre-commit hooks passed on my computer, so I think that was unrelated commits by others to vLLM since.

For the test with increased GPU VRAM usage, some increase makes sense due to the creation of additional tensors. The previous code just did an in-place divide after the conversion to float32, while the current code needs to compute tensors for both cases and sum them. I think correctness of outputs at zero temperature (a very common use case) is more important though than saving GPU VRAM at the sampling step. Also it may be possible to optimize the code to do more stuff in-place, but some new tensors are inevitable I think with doing it this way.

However, it's possible to avoid the increased GPU RAM usage with a custom CUDA kernel. Basically re-write this as a CUDA kernel and call that. No new tensors will be needed as the computations will be done in temporary variables within a loop, with the final output values assigned directly to the output tensor. I think we should first merge this though and then implement the CUDA kernel as an optimization. I have no experience writing CUDA kernels (though I did write and optimize OpenCL code many years ago).

WoosukKwon · 2025-02-11T18:29:15Z

Wow this is amazing! Thanks for the thorough investigation!

WoosukKwon · 2025-02-14T07:21:35Z

Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.

njhill · 2025-02-14T23:50:03Z

@StanHatko thank you for the investigation.

To me this looks like a workaround to an underlying problem of doing argmax on the logprobs for greedy.

Could you try out this PR which fixes that? #13312

At a minimum I think this should avoid the need for the * 1e9 * (logits == logits.max(dim=1, keepdim=True)[0]) part, i.e. instead just keep

   logits_z = is_zero * logits

Man I forgot how convoluted our v0 sampler code was 😅

StanHatko · 2025-02-15T01:37:40Z

Yes fixing the sampler to be deterministic in the temperature = 0 case seems like a better solution, ideally by just taking the largest logit in that case everything here can be avoided. I'll check if that pull request fixes the issue.

StanHatko · 2025-02-17T18:25:46Z

Unfortunately I'm having trouble reproducing some of the old non-deterministic runs from before. There were a bunch of changes since and the prompts here were long, complex, and had data that can't be released publicly. With simple prompts I didn't see the non-determinism in some basic tests. If anyone has a minimum working example please post it.

Some performance optimizations for throughput I enabled (in particular enable prefix caching, sorting before feeding in to take advantage of prefix caching, enable prefill chunking, and CUDA graph optimization) together seemed to reduce the amount of non-determinism (went down from 3, 4, 1, 2, 3 unique responses for 5 different queries each repeated 100 times down to 1, 2, 1, 1, 2).

The new PR looks good. I agree the sampler itself is the better place to make the changes, the reason I didn't do that myself is that I wasn't familiar with the sampler code but understood the logits and logprobs code. For the greedy sampling, is it guaranteed the flag will be set for zero-temperature entries? If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

njhill · 2025-02-17T19:04:47Z

Thanks @StanHatko

If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

@StanHatko yes that's correct, and greedy requests are determined by temperature < _SAMPLING_EPS.

mergify · 2025-05-11T08:02:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @StanHatko.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fix zero temperature case

820371f

supports both zero and nonzero temperatures in batch see vllm-project#5404 (comment) Signed-off-by: Stan Hatko <stan_hatko@live.com>

StanHatko requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners February 6, 2025 02:59

adjust so passes pre-commit hooks

3069a81

Signed-off-by: Stan Hatko <stan_hatko@live.com>

ywang96 reviewed Feb 7, 2025

View reviewed changes

Muhtasham approved these changes Feb 10, 2025

View reviewed changes

psyhtest mentioned this pull request Feb 18, 2025

Llama3.1-405B non-zero temperature mlcommons/inference#2116

Open

mergify bot added the needs-rebase label May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

StanHatko commented Feb 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 6, 2025

Uh oh!

ywang96 left a comment

Uh oh!

StanHatko commented Feb 9, 2025

Uh oh!

WoosukKwon commented Feb 11, 2025

Uh oh!

WoosukKwon commented Feb 14, 2025

Uh oh!

njhill commented Feb 14, 2025 •

edited

Loading

Uh oh!

StanHatko commented Feb 15, 2025

Uh oh!

StanHatko commented Feb 17, 2025

Uh oh!

njhill commented Feb 17, 2025

Uh oh!

mergify bot commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

Are you sure you want to change the base?

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

Conversation

StanHatko commented Feb 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

StanHatko commented Feb 9, 2025

Uh oh!

WoosukKwon commented Feb 11, 2025

Uh oh!

WoosukKwon commented Feb 14, 2025

Uh oh!

njhill commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanHatko commented Feb 15, 2025

Uh oh!

StanHatko commented Feb 17, 2025

Uh oh!

njhill commented Feb 17, 2025

Uh oh!

mergify bot commented May 11, 2025

Uh oh!

Uh oh!

StanHatko commented Feb 6, 2025 •

edited by github-actions bot

Loading

njhill commented Feb 14, 2025 •

edited

Loading